Project 3¶
Author: AJ Heskett Date: 2025-11-18
part_1_2¶
Part 1¶
The small size of the banana dataset as well as the consistency of iamge throughout means the model learned an accurate baseline very quickly and did not need much training.
The model as able to mostly accurately track where the bananas are, but it is not without error. The owl one has a hallucinated banana box, but the rest work quite well.
The plot shows that the classification error goes down in the first few epochs and then converges at a very low rate, indicating the model is confidently recognizing bananas in the dataset. The bounding-box MAE steadily decreases as well, suggesting that the model is learning to localize bananas accurately. In the sample images all bananas get detected with high confidence and the bounding boxes localize them well. My own images were not being detected as they should. There is some issue where it is almost detecting just the entire middle 1/3 of the image and computing that as the bounding box. I do not know why it is doing this, but it cuold be something to do with how I am not using bananas dataset.
Part 2¶
The main difference between me and pytorchs implementation is they had a much more effective method for shrinking th ebounding boxes to cover the object more accurately. I had a larger variety of bounding box size and dimensions on images compared to pytorch as well. The purpose of NMS's is to limit the number of detections to only the ones of the highest likelhood. This is very effective, and can very much help limit false positive predictions. However I believe it can also lead to more false negatives, as a low confidence to due obscurity or noise can cause models to miss detections that they would normally find without an NMS. Effectively, NMS shrinks your net; you can catch less junk but there is a chance you will miss more fish.
part_3¶
Part 3: Human–Object Interaction (HOI) Analysis using VLM¶
Perform zero-shot HOI analysis using VLMs on a subset of HICO-DET dataset (huggingface zhimeng/hico_det · Datasets at Hugging Face).
- Use one (or more) open-source or closed-source VLMs (e.g., Gemini, GPT, LLaVA, Qwen) to predict human–object interactions.
- Come up with your prompt to guide the VLMs to predict
. For example, , .
- Come up with your prompt to guide the VLMs to predict
Discussion: Can you identify a few failure cases where VLMs fail to prediction the HOI classes for the given images? If so, discuss the possible reasons.¶
image 1 p1¶
Image 2¶
Image 3¶
Image 4¶
Image 5¶
Image 6¶
I used Chat-GPT 5.1 to conduct my HOI experiments. It was provided the following prompt: "Your job is to look at a scene, find every human and every object, and describe what action connects them. For each interaction you identify, report the action, name the object involved, and include the human and object bounding boxes in the form [x1,y1,x2,y2]. Your entire response must be valid JSON" Even with this relativly easy prompt gpt shows good results for detecting HOIs. The bounding boxes mostly correctly span the object and human in the image. Image 1 missed the baseball bounding box, as well as the Catcher's bounding box was centered on his arm. Image 2 had a problem where it seems the boxes got warped to the right. I do not know if it was due to formatting, but no other image had this issue. Image 3 was also very good, but the bounding box for the verb-object could have been larger. 4 and 5 both seemed to be very accurate with just subtle pieces outside the Bounding Box. Image 6 had no interactions as shown.
Give it a try to fix the failure cases via better prompts or few-shot examples (in-context learning). Discuss if your solution works or the failure cases still cannot be solved.¶
Second Run¶
Made changes to specifically describe the format to get more consistency in response
Your job is to look at a scene, find every human and every object, and describe what action connects them. For each interaction you identify, report the action, name the object involved, and include the human and object bounding boxes in the form [x1,y1,x2,y2]. Make sure the human BB covers the entire human. Your entire response must be valid JSON that is in the same format as the following: [ { "action": "drinking", "object": "bottle", "human_bbox": [300, 60, 530, 350], "object_bbox": [345, 125, 430, 170] }, { "action": "resting_near", "object": "bottle_on_ground", "human_bbox": [300, 60, 530, 350], "object_bbox": [150, 280, 190, 360] } ] of this image.
This should hopefully reduce some fo the consistency error of the JSONs, as well as fitting the Bounding Boxes over the humans more consistent.
Image 1¶
Image 2¶
Image 3¶
Image 4¶
Image 5¶
Image 6¶
Discussion: I think this mostly improved the images, even if only slightly. I will say the last images 4 and 5 performed worse under the new prompt, and I think it struggles with people who are not standing tall. The first image did have more of the catcher in frame, and the 2nd and 3rd images also had improved human bounding boxes. However image 2s were fully corrected into a more accurate spot, while image 3 had a worse object interation detection and moved its bounding box up. None of the changes altered the HOI interactions given though. image 6 also did not get any hallcunitory interactions.